- The Data Science Process
- The Case
- The Tidyverse
Tuesday, August 28th, 2018



R!R and the TidyverseR without any packages.R is a bad programming language.If you have prior experience in R and did not begin all your scripts with library(dplyr)….
R and the Tidyverse


x %>% f() \(\Longleftrightarrow\) f(x)x, and then do f to it"x %>% f(y) \(\Longleftrightarrow\) f(x,y)x %>% f(y) %>% g(z) \(\Longleftrightarrow\) g(f(x,y),z)x, then do f with option y, then do g with option z…"# familiar
listings %>% glimpse() # = glimpse(listings)
listings %>% head() # = head(listings)
listings %>% colnames() # = colnames(listings)
# get all columns with "review_scores" in the column name
listings %>% select(contains('review_scores'))
# what should this return?
listings %>% select(contains('review_scores')) %>% colnames()
# compare: colnames(select(listings, contains('review_scores')))
Let's try this out – back to the case study!

data %>% mutate(new_col = formula(old_col1, old_col2) creates new columns.data %>% group_by(col) groups data for breakout summaries.data %>% summarise(measure = formula(col1, col2)) computes summaries.data %>% group_by(col) %>% summarise(measure = formula(col1, col2)) computes breakout summaries.filter and summarisejoining datacalendar %>%
summarise(earliest = min(date),
latest = max(date))
## # A tibble: 1 x 2 ## earliest latest ## <date> <date> ## 1 2016-09-06 2017-09-05
But some of these listings may be "zombies" without recent availability. How can we include only listings with availability from a certain time period?
calendar table (exercise)join that information to the listings table (together)The information we need is distributed between two tables – how can we get there?
We need a key column that tells us which calendar rows correspond to which listings.
listings$idcorresponds tocalendar_listing$id
joinThe join family of functions lets us add columns from one table to another using a key.
x %>% left_join(y) : most common, keeps all rows of x but not necessarily y.x %>% right_join(y) : keeps all rows of y but not necessarily x.x %>% outer_join(y) : keeps all rows of both x and yx %>% full_join(y) : keeps only rows of x that match in y and vice versa.We'll use left_join for this case. On to the exercise!
ggplot2
Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.
– Edward Tufte
A grammar is a set of components (ingredients) that you can combine to create new things. Many grammars have required components: if you're missing one, you're doing it wrong. In baking….

gg in ggplot2.tidyverse

Data: almost always a data_frameAesthetic mapping: relation of data to chart components.Geometry: specific visualization type? E.g. line, bar, heatmap?Statistical transformation: how should the data be transformed or aggregated before visualizing?Theme: how should the non-data parts of the plot look?+ plays the same role in ggplot2 that %>% does in data manipulation.)Does getting lots of reviews usually mean you get good reviews?
listings %>%
ggplot()
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating)
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point()
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2)
listings %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw()
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw()
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2) +
theme_bw() +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating) +
geom_point(alpha = .2, color = 'firebrick') +
theme_bw() +
labs(x='Number of Reviews', y='Review Score',title='Review Volume and Review Quality')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = review_scores_value,
y = review_scores_location,
size = number_of_reviews) +
geom_point(alpha = .2, color = 'firebrick') +
theme_bw()
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = review_scores_value,
y = review_scores_location,
fill = number_of_reviews) +
geom_tile() +
theme_bw()
The following code computes the average price of all listings on each day in the data set:
average_price_table <- calendar %>%
mutate(price = price %>% gsub('\\$|,', '',.) %>% as.numeric()) %>%
group_by(date) %>%
summarise(mean_price = mean(price, na.rm = TRUE))
Use geom_line() to visualize these prices with time on the x-axis and price on the y-axis.
average_price_table %>%
ggplot() +
aes(x = date, y = mean_price) +
geom_line()
Using the summary_table object you created earlier, make a bar chart showing the number of apartments by neighbourhood. In this case, the correct geom to use is geom_bar(stat = 'identity').
summary_table %>%
filter(property_type == 'Apartment') %>%
ggplot() +
aes(x = neighbourhood, y= n) +
geom_bar(stat = 'identity')
summary_table %>%
filter(property_type == 'Apartment') %>%
ggplot() +
aes(x = reorder(neighbourhood, n), y=n) +
coord_flip() +
geom_bar(stat = 'identity')
summary_table %>%
ggplot() +
aes(x = reorder(neighbourhood, n), y=n, fill = property_type) +
coord_flip() +
geom_bar(stat = 'identity')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating, color = property_type) +
geom_point(alpha = .5) +
theme_bw() +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality')
listings %>%
filter(number_of_reviews < 100) %>%
ggplot() +
aes(x = number_of_reviews, y = review_scores_rating, color = property_type) +
geom_point(alpha = .5) +
theme_bw() +
facet_wrap(~property_type) +
labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality')
listings %>%
select(number_of_reviews, contains("review_scores"), - review_scores_rating) %>%
gather(key = type, value = score, -number_of_reviews) %>%
ggplot() +
aes(x = factor(score), y = number_of_reviews) +
geom_boxplot() +
facet_wrap(~type)
wrangle_viz/dashboard.Rmdknit button at the top of RStudio and observe the result. If you see a dashboard, then are good to go.author metadata up topR "code chunks" that begin with ```{r}knit your dashboard again. Save the .Rmd file. We're coming back to it this afternoon!R graphics cheatsheet, R Graphics Cookbook
ggplot2 CheatsheetR Graphics Cookbook, by Winston ChangRRR packagesR, statistics, and data science at FiveThirtyEight (they use R!)